Imports needed for the script. Uses Python 2.7.13, numpy 1.11.3, pandas 0.19.2, sklearn 0.18.1, scipy 0.18.1, matplotlib 2.0.0.
In [121]:
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt
Read in the data, it is from the UCI Wholesale Customer Dataset at:
In [122]:
df = pd.read_csv('Wholesale customers data.csv')
Create a feature for total customer size. Note: 'Delicassen' misspelled in original data file.
In [123]:
df['Total'] = df['Fresh'] + df['Milk'] + df['Grocery'] + df['Frozen'] + df['Detergents_Paper'] + df['Delicassen']
print df.head()
Add a function to convert and join dummy variables to the model. If source df = dest df then remember to delete the original catagorical variable.
In [124]:
def get_dummies(source_df, dest_df, col):
dummies = pd.get_dummies(source_df[col], prefix=col)
print 'Quantities for %s column' % col
for col in dummies:
print '%s: %d' % (col, np.sum(dummies[col]))
print
dest_df = dest_df.join(dummies)
return dest_df
Process dummy variables for the 'Channel' and 'Region' features. Drop original categorical feature, plus drop one of the dummy variables for 'leave one out' encoding.
In [125]:
df = get_dummies(df, df, 'Channel')
df.drop(['Channel', 'Channel_2'], axis=1, inplace=True)
df = get_dummies(df, df, 'Region')
df.drop(['Region', 'Region_3'], axis=1, inplace=True)
df.rename(index=str, columns={'Channel_1': 'Channel_Horeca', 'Region_1': 'Region_Lisbon', 'Region_2': 'Region_Oporto'},
inplace=True)
print df.head()
Plot a histogram of customer size, shows a small number of large customers, many smaller customers.
In [126]:
plt.hist(df['Total'], bins=32)
plt.xlabel('Total Purchases')
plt.ylabel('Number of Customers')
plt.title('Histogram of Customer Size')
plt.show()
plt.close()
Scale the data so no category dominates due to numeric scale.
In [127]:
sc = StandardScaler()
sc.fit(df)
X = sc.transform(df)
Set up a plotting function for K Means output.
In [128]:
def plot_kmeans(pred, centroids, x_name, y_name, x_idx, y_idx, k):
for i in range(0, k):
plt.scatter(df[x_name].loc[pred == i], df[y_name].loc[pred == i], s=6,
c=colors[i], marker=markers[i], label='Cluster %d' % (i + 1))
centroids = sc.inverse_transform(kmeans.cluster_centers_)
plt.scatter(centroids[:, x_idx], centroids[:, y_idx],
marker='x', s=180, linewidths=3,
color='k', zorder=10)
plt.xlabel(x_name)
plt.ylabel(y_name)
plt.legend()
plt.show()
plt.close()
Set up some markers and colors.
In [129]:
markers = ('s', 'o', 'v', '*', 'D', '+', 'p', '<', '>', 'x')
colors = ('C0', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9')
Try K Means with k = 3 (Mainly because when I previously studied this using R, 3 produced interesting results). The value for 'k' will be set from a distortion graph later in the notebook.
In [130]:
k=3
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
pred = kmeans.predict(X)
for i in range(0, k):
x = len(pred[pred == i])
print 'Cluster %d has %d members' % ((i + 1), x)
Use an inverse transform on the centroids. Needed because the centroids were calculated on the scaled data and we would like the centroids to plot correctly with the original data.
In [131]:
centroids = sc.inverse_transform(kmeans.cluster_centers_)
In [132]:
plot_kmeans(pred, centroids, 'Frozen', 'Detergents_Paper', 3, 4, k)
Commentary: The presence of a few large customers makes plot with more clusters hard to interpret because the mass of smaller customers is pushed into the lower right corner of the graph above. Though large customer behavior is interesting, especially in a business setting, here I will focus on smaller customers. A look at the histogram above appears to make a total sales level of 75,000 euros a reasonable break point. As an aside, when I tried this in R, the K Means algorithm tended to produce two large customer clusters and one for the great mass of smaller customers. The plot above was chosen because it is representative of a pattern in the data - namely that there are customers who buy one type of product from our wholesale client but do not buy other types of products. Keeping this Jupyter notebook to a reasonable size precludes showing all possible combinations of product pairs as graphs here.
Set up the data structures for the smaller customers. We need to rescale the data because of the elimination of larger customers.
In [133]:
df = df.loc[df['Total'] <= 75000]
sc = StandardScaler()
sc.fit(df)
X = sc.transform(df)
Plot a distortion or elbow graph to help chose an optimal value for k.
In [134]:
K = range(1, 20)
mean_distortions = []
for k in K:
np.random.seed(555)
kmeans = KMeans(n_clusters=k, init='k-means++')
kmeans.fit(X)
mean_distortions.append(sum(np.min(cdist(X, kmeans.cluster_centers_, 'euclidean'), axis = 1))/ X.shape[0])
plt.plot(K, mean_distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting K w/ Elbow Method')
plt.show()
plt.close()
Distortion plots often have elbow points that are difficult to interpret, but k = 6 looks like it might make for an interesting set of clusters and plots.
In [135]:
np.random.seed(555) # sets seed, makes it repeatable
k = 6
kmeans = KMeans(n_clusters=k) # , init='random')
kmeans.fit(X)
pred = kmeans.predict(X)
for i in range(0, k):
x = len(pred[pred == i])
print 'Cluster %d has %d members' % ((i + 1), x)
Plot K Means result with centroids.
In [136]:
centroids = sc.inverse_transform(kmeans.cluster_centers_)
plot_kmeans(pred, centroids, 'Frozen', 'Detergents_Paper', 3, 4, k)
The pattern seen in the full data set also shows up amongst the set of customers with less than 75,000 euros in total sales. There are, for example, a number of customers who buy detergents and paper from our client but not frozen goods, the inverse also appears to be true. A business domain expert could be consulted in order to try and determine if this is charasteristic of the needs of these customers or perhaps there are missed opportunities for our client's marketing department.
The pattern noted is especially apparent in clusters 1, 3 and 5; lets see what we can learn about these clusters.
A function to print cluster data.
In [137]:
def print_cluster_data(cluster_number):
print '\nData for cluster %d' % cluster_number
cluster = df.loc[pred == cluster_number - 1, :]
# print cluster1.head()
num_in_cluster = float(len(cluster.index))
num_horeca = float(np.sum(cluster['Channel_Horeca']))
num_retail = float(num_in_cluster - num_horeca)
print 'Percent Horeca: %.2f, Percent Retail: %.2f' % \
(num_horeca / num_in_cluster * 100.0, num_retail / num_in_cluster * 100.0)
num_lisbon = float(np.sum(cluster['Region_Lisbon']))
num_oporto = float(np.sum(cluster['Region_Oporto']))
num_other = num_in_cluster - num_lisbon - num_oporto
print 'Percent Lisbon: %.2f, Percent Oporto: %.2f, Percent Other: %.2f' % \
(num_lisbon / num_in_cluster * 100.0, num_oporto / num_in_cluster * 100.0, num_other / num_in_cluster * 100.0)
avg_cust_size = np.sum(cluster['Total']) / num_in_cluster
print 'Average Customer Size is: %.2f for %d Customers' % (avg_cust_size, num_in_cluster)
Print out data for clusters 1, 3 and 5.
In [138]:
print_cluster_data(cluster_number=1)
print_cluster_data(cluster_number=3)
print_cluster_data(cluster_number=5)
Commentary: clusters 3 and 5 are dominated by retail customers, in an actual business setting we might want to investigate why they buy lots of detergents and paper from our client but not much in the way of frozen goods. Cluster 1 represents customers from the "Horeca" (hotel, restaurant, cafe) distribution channel. These customers tend to buy frozen goods but not detergents and paper, perhaps because they sell food and only use detergents and paper as maintenance supplies. Clustering can help in a marketing segmentation analysis by identifying types and groups of customers.
I hope that you have enjoyed my example of using Python and K Means to identify some of the patterns in the UCI Wholesale Customers data set.